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Code Placement using Temporal Profile Information - Gloy (1998) (2 citations) (Correct^ 

....with the goal of more efficiently using the instruction cache. Several compile time code placement techniques have 
been proposed that use heuristics and profile information to reduce the number of conflict misses in the primary 
(firstlevel or L1) instruction cache by reordering the program code [3, 18, 28, 27, 36]. Most of this work uses cache 
parameters such as cache size and line size as well as procedure sizes to accurately model the cache mapping of the 
code. The code placement algorithms typically use some kind of profile information to find a cache mapping that reduces 
cache conflict misses. These .... 

....use a WCG, which contains only information about pairs of procedures connected by a direct call and no information 
about the temporal ordering of procedure calls. As an example, Figure 2 shows two programs that result in the same WCG 
but have substantially different temporal behavior. McFarling [26] uses profile data that incorporates loop counts and 
probabilities for conditionals, but still retains the limitations mentioned above. Basic block transitions, used by 
Torrellas et al. 36] share these limitations. Our technique is based on a profiling scheme that captures important 
information .... 
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granularity. Pettis and Hanson [20] Scott McFarling [10], Hat eld and Gerald [11] Gloy and Smith [15] have presented 
3 methods to rearrange the procedures, which comprise the executable, based on pro le data to improve memory 
locality. Most of these use pro le data in the form of a weighted call graph (WCG) In a WCG, there is a node for each .... 
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....the working set size and cache con icts. There has been prior work in changing code layout at the function level at 
compile time as well as dynamically. There have also been e orts to exploit program locality dynamically at other levels of 
granularity. Pettis and Hanson [21] Scott McFarling [18], Hat eld and Gerald [13] Gloy and Smith [15] have presented 
methods to rearrange the procedures, which comprise the executable, based on pro le data to improve memory 
locality. Most of these use pro le data in the form of a weighted call graph (WCG) In a WCG, there is a node for each .... 
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....in which blocks b and d are packed into the same cache line. The number of cache misses is greatly reduced in this 
case. Here we have illustrated the application of code layout optimization at the basic block level. Techniques for layout 
optimization at procedural level have also been developed [29], 5. CONCLUDING REMARKS In this chapter we have 
identi ed optimization opportunities that may exist during program execution but cannot be exploited without the availability 
of pro le data. Di erent types of pro le data that are useful for code optimization were identi ed. The use of this pro le .... 
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Software-assisted Cache Replacement Mechanisms for.. - Jain. Devadas. Enaels. .. (2001) (2 citations) {Conici} 

....a varied set of methods for automatic cache control instruction insertion. 2.2 Memory Exploration in Embedded Systems 
Cache memory issues have been studied in the context of embedded systems. McFarling presents techniques of code 
placement in main memory to maximize instruction cache hit ratio [10, 14]. A model for partitioning an instruction 
cache among multiple processes has been presented [7] Panda, Dutt and Nicolau present techniques for partitioning on 
chip memory into scratchpad memory and cache [12] The presented algorithm assumes a fixed amount of scratchpad 
memory and a fixed size .... 

S. McFarling Progmm Optimization tor Instruction Caches, in Proceedings of the 3 rd !nt ! i Conference on Architectural 
Support for Programming Languages and Operating Systems, pages 183-191 ( April I989. 
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... OF THE CACHE AREA AND LATENCY MODEL: A AREA, L LATENCY [13] conflict misses in large direct mapped 
instruction caches has been proposed [3] Static code repositioning by using cache line coloring at the procedure 
or basic block level has been an alternative approach proposed and evaluated in [12], 13] and [27] Similar 
technique for profile driven data repositioning has been proposed in [26] III. PERFORMANCE MODELING In this 
section, we describe the hardware performance models for caches and processor cores. Three factors combine to 
influence system performance: cache miss rates, .... 

S. McFarling. *' Progmm optimization for instruction caches" in Proc. Int. Conf. Architectural Support for Programming 
Languages and Operating Systems, 1 9B9 : Pp. 1 83—1 91 . 
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....this work is to extend their work to static profiling. The advantages of the availability of static profiling are quite obvious. 
Many parts of optimizing compilers rely on profile data to perform good optimizations, for example trace 
scheduling [Fis81] register allocation [Wal86] and code motion [McF89]. In general these optimizations take 
advantage of locating those 10 of the program code, in which 90 of the run time is spent. Up to now it is common practice to 
get profile information by running a program and measuring the interesting data (e.g. block counts) by an appropriate tool 
like prof, .... 

Scot! McFarling. Program optimization for instruction caches, in Third International Symposium or; Architectural Support tor 
Programming Languages and Operating Systems, April 1889. Published as Computer Architecture News 17(2). 



Code Layout Optimizations for Transaction Processing.. - Ramirez.. (2001) (2 citations) {Correct} 

...TPC C benchmarks on AlphaServer. 6 Discussion and Related Work Code layout optimizations were originally 
proposed to reduce the working set size of applications for virtual memory [8, 11, 10] More recent work has focused on the 
reduction of branch mispredicts and cache misses. McFarling [18] describes an algorithm that uses the loop and call 
structure of a program to determine which parts of the program should overlay each other in the cache and which 
parts should be assigned to non conflicting addresses. Hwu and Chang [13] describe a profile based algorithm which 
uses function .... 

S. McFarling. Program optimization for instruction caches. Proceedings of the 3rd frstL Conferoneeon Architectural Support 
for Programming Languages arid Operating Systems, pages 1S3™191 } Apr. 1983, 



Automated Design of Finite State Machine Predictors for.. - Sherwood. Calder (2001) £Correcjt). 

....FSM predictors are used in a few areas of computer architecture, and summarize initial results for using automated FSM 
predictors to guide confidence estimation used to guide value prediction. 6. 1 Cache Management Cache management 
schemes have been proposed that perform intelligent replacement [16], cache exclusion [29] and they use a small 
FSM counter to determine when the optimization should be applied. In addition, prefetching architectures have used 
FSM predictors to determine when to initiate prefetching for a load and to guide stream buffer allocation [25] 6.2 Power 
Control Manne .... 

S. McFarling. Program optimization for instruction caches, in Proceedings of the Third international Conference on 
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....of the mispredict recovery time. These^ro factors have long been identified as fetc^^rformance bottlenecks and are 
relatively well researched topics. The design of instruction caches has been studied in great detail in order to lessen 
the impact of instruction cache misses on fetch bandwidth [31 j [36] Likewise, there have been many studies done 
to improve branch prediction accuracy [16] 25] To date, the techniques developed to reduce instruction cache 
misses and increase branch prediction accuracy have been very successful in improving fetch bandwidth. 
However, as the issue rates for .... 

8, fvieFariing, "Program Optimization for instruction Caches" Proceedings of the Third international Conference on 
Architectural Support for Programming Languages and Operating Systems. April 1989 
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....space to a small region in the cache (useful for memory mapped devices) 4.2 Memory Exploration in Embedded Systems 
Cache memory issues have been studied in the context of embedded systems. McFarling presents techniques of code 
placement in main memory to maximize instruction cache hit ratio [8, 16J. A model for partitioning an instruction cache 
among multiple processes has been presented [6] Panda, Dutt and Nicolau present techniques for partitioning on chip 
memory into scratchpad memory and cache [1 1] The presented algorithm assumes a fixed amount of scratchpad memory 
and a fixed size .... 



S. McFariinj}. Program Optimization for instruction Caches, in Proceedings of the 3 rd (nil Conference on Architectural 
Support for Programming Languages and Operating Systems, pages 183-191, April 1989. 
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....cache is explored. 2. If an outer loop does not fit in the cache there is no reuse from one iteration of this loop to the 2 
next. In this case, innermost loops still may exhibit reuses. 3. There is no instruction cache interference. This problem can 
be addressed independently with methods [13, 12, 2Sj to lay out the code such as no or few interferences occur. 
We only consider the first level instruction cache. 4. Data cache misses are assumed invariant by loop unrolling. We could 
verify this assumption in our experiments 1 . 5. It is assumed that the compiler generates the code for .... 

S. McFarling, Program optimization ibr instruction caches. In Proceedings of the Third international Conference on 
Architects; Support for Programming Languages and Operating Systems (ASPLOS pages 183-1S1, April 1989. 
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....counts, is the most important, especially for integer programs. The second use, delaying register reuse, is more 
important for floating point programs where scheduling for long operation latency is important. 3. 7 Code layout Our code 
layout algorithm is essentially the same as Pettis and Hansen [4,5,6,9]. Its goal is to reduce instruction cache misses 
and improve instruction fetch by using profile information to guide the layout of code in memory. We found that the algo # # 
Figure 4: Function from xlisp where rarely called is useful 6 rithm worked well, except in its handling of branches for .... 

S. MoFsninos. "Program Optimization for instruction Caches" ASPLOS lii Proceedmos, Boston, Mass c'Aonf 1989}; 183- 
193. 
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....cloned program, something that currently consumes an inordinate amount of compile time. When we apply the profiling 
based code placement to larger programs, we will no doubt see its effectiveness diminished, but fortunately, 
several more sophisticated algorithms can be found in the literature [8, 7> 6]. We would also like to find a better 
heuristic method, since profile guided methods are less convenient to use. Also, we have not yet looked at aligning 
procedures at cache line boundaries which might further decrease the miss rate 

Scott McFarling. Program optimization for instruction caches. In Proceedings of the Third international Conference on 
Architectural Support for Programming Languages and Operating Systems, pages 183-191, *!§89. 
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....keep track of dynamic branch information. A two level branch predictor proposed in [YePa91] uses two levels of branch 
history to predict branch direction. Hybrid branch predictors composed of several single scheme predictors and a 
way to select one of them at a particular time have been proposed [McFa89 s ChHP95]. Instruction prefetching has 
been addressed in the past primarily through sequential prefetch or code layout techniques [Smit82, DEC82, SmHs92, 
HwCh89, McFa89, Joup90, ERPR95, UNMS95, XiTo96, LBCG95 Intel93] Sometimes instruction prefetch was initiated 
along both possible branch paths .... 
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... branch direction. Hybrid branch predictors composed of several single scheme predictors and a way to select one 
of them at a particular time have been proposed [McFa89, ChHP95] Instruction prefetching has been addressed in 
the past primarily through sequential prefetch or code layout techniques [Smil82, DEC82, SrnHs92< HwCh89, 
McFa89, JoupSQ, ERPRSS, UfMSWSSS, XsTo86 T L8CG3S Snte!93j. Sometimes instruction prefetch was initiated along both 
possible branch paths [Intel93] Compiler assistance can help by code layout or by identifying the end of a basic block to 
stop prefetching [HwCh89, McFa89, XiTo96] The main improvement comes from adding a sequential prefetcher as has 
been .... 
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Microarchitectural and Compile Time Optimizations for.. - Kalamatianos (2000) (1 citation) ^Correct) 

.... to store the conflicting code module in the cache [49] Hi) implement a mapping function in hardware so that 
fewer code modules map to the same cache location [50, 51] and (iv) reorder the code modules in the main 
memory address space at compile time so that fewer conflicts may occur at run time [S2, S3, 54], We pursue the last 
method. We first study the temporal interaction among procedures since accurate temporal information has not been used 
in the context of code reordering until recently [55] We then attempt to improve code spatial locality and instruction fetch 
efficiency with intraprocedural .... 

. ..block globally. Branch alignment is a form of basic block positioning technique that attempts to minimize the 
effects of 11 branch mispredictions and misfetches [69, 70, 71] Most other related work on basic block reordering 
has targeted improving fetch unit effectiveness and memory access time [53« 34> 66, 72, 87]. The main idea behind 
all these strategies is to rearrange code units so that conflicts between them at different levels of the memory hierarchy (1st 
and 2nd level caches, main memory) are reduced. In addition, the new ordering of code should improve spatial locality and 
cache utilization. We .... 
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